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Abstract 

The neighbourhood function No(t) of a graph G gives, for each t € N, the number of 
pairs of nodes {x, y) such that y is reachable from x in less that t hops. The neighbourhood 
function provides a wealth of information about the graph [PGF02] (e.g., it easily allows one 
to compute its diameter), but it is very expensive to compute it exactly. Recently, the ANF 
algorithm [PGF02] (approximate neighbourhood function) has been proposed with the purpose 
of approximating Nc(t) on large graphs. We describe a breakthrough improvement over ANF 
in terms of speed and scalability. Our algorithm, called HyperANF, uses the new HyperLogLog 
counters [FFGM07] and combines them efficiently through broadword programming [Knu07]; 
our implementation uses task decomposition to exploit multi-core parallelism. With HyperANF, 
for the first time we can compute in a few hours the neighbourhood function of graphs with 
billions of nodes with a small error and good confidence using a standard workstation. 

Then, we turn to the study of the distribution of distances between reachable nodes (that 
can be efficiently approximated by means of HyperANF), and discover the surprising fact that 
its index of dispersion provides a clear-cut characterisation of proper social networks vs. web 
graphs. We thus propose the spid (Shortest-Paths Index of Dispersion) of a graph as a new, 
informative statistics that is able to discriminate between the above two types of graphs. We 
believe this is the first proposal of a significant new non-local structural index for complex 
networks whose computation is highly scalable. 

1 Introduction 

The neighbourhood function Nq (t) of a graph returns for each t e N the number of pairs of nodes 
(x, y) such that y is reachable from x in less that t steps. It provides data about how fast the 
"average ball" around each node expands. From the neighbourhood function, several interesting 
features of a graph can be estimated, and in this paper we are in particular interested in the effective 
diameter, a measure of the "typical" distance between nodes. 

Palmer, Gibbons and Faloutsos [PGF02] proposed an algorithm to approximate the neighbour- 
hood function (see their paper for a review of previous attempts at approximate evaluation); the 
authors distribute an associated tool, snap, which can approximate the neighbourhood function of 
medium-sized graphs. The algorithm keeps track of the number of nodes reachable from each node 
using Flajolet-Martin counters, a kind of sketch that makes it possible to compute the number of 
distinct elements of a stream in very little space. A key observation was that counters associated 
to different streams can be quickly combined into a single counter associated to the concatenation 
of the original streams. 

In this paper, we describe HyperANF — a breakthrough improvement over ANF in terms of 
speed and scalability. HyperANF uses the new HyperLogLog counters [FFGM07], and combines 
them efficiently by means of broadword programming [Knu07] . Each counter is made by a number 
of registers, and the number of registers depends only on the required precision. The size of each 
register is doubly logarithmic in the number of nodes of the graph, so HyperANF, for a fixed preci- 
sion, scales almost linearly in memory (i.e., O(nloglogn)). By contrast, ANF memory requirement 
is 0(n log n). 

Using HyperANF, for the first time we can compute in a few hours the neighbourhood function 
of graphs with more than one billion nodes with a small error and good confidence using a standard 
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workstation with 128 GB of RAM. Our algorithms are implement in a tool distributed as free 
software within the WebGraph framework. 1 

Armed with our tool, we study several datasets, spanning from small social networks to very 
large web graphs. We isolate a statistically defined feature, the index of dispersion of the distance 
distribution, and show that it is able to tell "proper" social networks from web graphs in a natural 
way. 

2 Related work 

HyperANF is an evolution of ANF [PGF02], which is implemented by the tool snap. We will give 
some timing comparison with snap, but we can only do it for relatively small networks, as the large 
memory footprint of snap precludes application to large graphs. 

Recently, a MapReduce-based distributed implementation of ANF called HADI [KTA+10] has 
been presented. HADI runs on one of the fifty largest supercomputers — the Hadoop cluster M45. 
The only published data about HADFs performance is the computation of the neighbourhood 
function of a Kronecker graph with 2 billion links, which required half an hour using 90 machines. 
HyperANF can compute the same function in less than fifteen minutes on a laptop. 

The rather complete survey of related literature in [KTA+10] shows that essentially no data 
mining tool was able before ANF to approximate the neighbourhood function of very large graphs 
reliably. A remarkable exception is Cohen's work [Coh97], which provides strong theoretical guar- 
antees but experimentally turns out to be not as scalable as the ANF approach; it is worth noting, 
though, that one of the proposed applications of [Coh97] (On-line estimation of weights of growing 
sets) is structurally identical to ANF. 

All other results published before ANF relied on a small number of breadth-first visits on 
uniformly sampled nodes — a process that has no provable statistical accuracy or precision. Thus, 
in the rest of the paper we will compare experimental data with snap and with the published data 
about HADI. 

3 HyperANF 

In this section, we present the HyperANF algorithm for computing an approximation of the neigh- 
bourhood function of a graph; we start by recalling from [FFGM07] the notion of HyperLogLog 
counter upon which our algorithm relies. We then describe the algorithm, discuss how it can be 
implemented to be run quickly using broadword programming and task decomposition, and give 
results about its memory requirements and precision. 

3.1 HyperLogLog counters 

HyperLogLog counters, as described in [FFGM07] (which is based on [DF03]), are used to count 
approximately the number of distinct elements in a stream. For the purposes of the present paper, 
we need to recall briefly their behaviour. Essentially, these probabilistic counters are a sort of 
approximate set representation to which, however, we are only allowed to pose questions about the 
(approximate) size of the set. 2 

Let ^ be a fixed domain and h : & — > 2°° be a hash function mapping each element of @ into 
an infinite binary sequence. The function is fixed with the only assumption that "bits of hashed 
values are assumed to be independent and to have each probability \ of occurring" [FFGM07]. 

For a given x € 2°°, let h t (x) denote the sequence made by the leftmost t bits of h(x), and 
/i*(x) be the sequence of remaining bits of x; ht is identified with its corresponding integer value in 
the range { 0, 1, . . . , 2* — 1 }. Moreover, given a binary sequence w, we let p + (w) be the number of 
leading zeroes in w plus one 3 (e.g., p + (00101) = 3). Unless otherwise specified, all logarithms are 

1 Scc [BV04]. http://webgraph.dsi.unimi.it/. 

2 We remark that in principle O(logn) bits are necessary to estimate the number of unique elements in a 
stream [AMS99]. HyperLogLog is a practical counter that starts from the assumption that a hash function can 
be used to turn a stream into an idealised multiset (see [FFGM07]). 

3 We remark that in the original HyperLogLog papers p is used to denote p + , but p is a somewhat standard 
notation for the ruler function [Knu07] . 
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in base 2. 



Algorithm 1 The Hyperloglog counter as described in [FFGM07]: it allows one to count (approx- 
imately) the number of distinct elements in a stream. a m is a constant whose value depends on m 
and is provided in [FFGM07]. Some technical details have been simplified. 



h : S> — > 2°° , a hash function from the domain of items 

1 M[—] the counter, an array of m = 2 b registers 

2 (indexed from 0) and set to — oo 
3 

4 function add(M: counter, x: item) 

5 begin 

6 i <— hb(x); 

7 M[i] <r- msix{M[i\,p+(h b (x))} 

8 end; // function add 
9 

10 function size(M: counter) 

11 begin 

12 Z<-(£?J 1 2- M Wy 1 ; 

13 return E — a m m 2 Z 

14 end; // function size 
15 

16 foreach item x seen in the stream begin 

17 add(M,a;) 

18 end; 

19 print size(M) 



The value E printed by Algorithm 1 is [FFGM07] [Theorem 1] an asymptotically almost unbiased 
estimator for the number n of distinct elements in the stream; for n — > oo, the relative standard 
deviation (that is, the ratio between the standard deviation of E and n) is at most f3 m /y/m < 
1.06/ y/m, where f3 m is a suitable constant (given in [FFGM07]). Moreover [DF03] even if the size 
of the registers (and of the hash function) used by the algorithm is unbounded, one can limit it to 
loglog(n/m) + u(n) bits obtaining almost certainly the same output (w(n) is a function going to 
infinity arbitrarily slowly); overall, the algorithm requires (1 + o(l)) • mloglog(n/m) bits of space 
(this is the reason why these counters are called HypcrLogLog). Here and in the rest of the paper 
we tacitly assume that m > 64 and that registers are made of [log log n\ bits. 

3.2 The Hyper ANF algorithm 

The approximate neighbourhood function algorithm described in [PGF02] is based on the observa- 
tion that B(x,r), the ball of radius r around node x, satisfies 

B(x,r)= |J B(y,r-1). 

Since B(x, 0) = { x }, we can compute each B(x,r) incrementally using sequential scans of the graph 
(i.e., scans in which we go in turn through the successor list of each node). The obvious problem is 
that during the scan we need to access randomly the sets B(x, r — 1) (the sets B(x, r) can be just 
saved on disk on a update file and reloaded later). Here probabilistic counters come into play; to 
be able to use them, though, we need to endow counters with a primitive for the union. Union can 
be implemented provided that the counter associated to the stream of data AB can be computed 
from the counters associated to A and B; in the case of HyperLogLog counters, this is easily seen 
to correspond to maximising the two counters, register by register. 
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The observations above result in Algorithm 2: the algorithm keeps one HyperLogLog counter 
for each node; at the t-th iteration of the main loop, the counter c[v] is in the same state as if it 
would have been fed with B(v,t), and so its expected value is \B(v,t)\. As a result, the sum of all 
c[w]'s is an (almost) unbiased estimator of No(t) (for a precise statement, see Theorem 1). 



Algorithm 2 The basic HyperANF algorithm in pseudocode. The algorithm uses, for each node 
i £ n, an initially empty HyperLogLog counter Cj. The function union( — , — ) maximises two counters 
register by register. 



c[— ], an array of n HyperLogLog counters 
1 

2 function union(M: counter, N: counter) 

3 foreach i < m begin 



10 end; 

11 t^O; 

12 do begin 

13 s <- E„ size ( c M); 

14 Print s (the neighbourhood function Na(t)) 

15 foreach v e n begin 

16 m <— c[v]; 

17 foreach v — > to begin 

18 m <— union(c[w], m) 

19 end; 

20 write (v, m) to disk 

21 end; 

22 Read the pairs (v,m) and update the array c[— ] 

23 t «- f + 1 

24 until no counter changes its value. 



We remark that the only sound way of running HyperANF (or ANF) is to wait for all counters to 
stabilise (e.g., the last iteration must leave all counters unchanged). As we will see, any alternative 
termination condition may lead to arbitrarily large mistakes on pathological graphs. 4 

3.3 HyperANF at hyper speed 

Up to now, HyperANF has been described just as ANF with HyperLogLog counters. The effect 
of this change is an exponential reduction in the memory footprint and, consequently, in memory 
access time. We now describe the the algorithmic and engineering ideas that made HyperANF 
much faster, actually so fast that it is possible to run it up to stabilisation. 

Union via broadword programming. Given two HyperLogLog counters that have been set by 
streams A and B, the counter associated to the stream AB can be build by maximising in parallel 
the registers of each counter. That is, the register i of the new counter is given by the maximum 
between the i-th register of the first counter and the i-th register of the second counter. 

Each time we scan a successor list, we need to maximise a large number of registers and store 
the resulting counter. The immediate way of obtaining this result requires extracting the value of 
each register, maximise it with the other corresponding registers, and writing down the result in a 

4 We remark that snap uses a threshold over the relative increment in the number of reachable pairs as a termination 
condition, but this trick makes the tail of the function unreliable. 
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M[i] <- m&x(M[i],N[i\) 



end 

end; // function union 



foreach v € n begin 

add v to c[v] 
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temporary counter. This process is extremely slow, as registers are packed in 64-bit memory words. 
In the case of Flajolet-Martin counters, the problem is easily solved by computing the logical OR 
of the words containing the registers. In our case, we resort to broadword programming techniques. 
If the machine word is w, we assume that at least w registers are allocated to each counter, so each 
set of registers is word-aligned. 

Let » and <C denote right and left (zero-filled) shifting, &, | and ® denote bit-by-bit not, and, 
or, and xor; x denotes the bit-by-bit complement of x. 

We use L k to denote the constant whose ones are in position 0, fc, 2fc, . . . that is, the constant 
with the lowest bit of each fc-bit subword set (e.g, L 8 = 0x01010101010101010101). We use H k 
to denote IjCfc- 1, that is, the constant with the highest bit of each fc-bit subword set (e.g, 
ff 8 = 0x8080808080808080). 

It is known (see [Knu07], or [Vig08] for an elementary proof), that the following expression 



performs a parallel unsigned comparison fc-by-fc-bit-wise. At the end of the computation, the highest 
bit of each block of fc bits will be set iff the corresponding comparison is true (i.e., the value of the 
block in x is strictly smaller than the value of the block in y). 

Once we have computed i we generate a mask that is made entirely of Is, or of 0s, for each 
fc-bit block, depending on whether we should select the value of x or y for that block: 



This formula works by moving the high bit denoting the result of the comparison to the least 
significant bit (of each fc-bit block). Then, we or with H k and subtract 1 from each block, obtaining 
either a mask with just the high bit set (if we were starting from 1) or a mask with all bits sets 
except for the high bit (if we were starting from 0). The last two operation fix those values so that 
they become 00 • • • or 11 • • • 1. The result of the maximisation process is now just x hvn\y hm. 

This discussion assumed that the set of registers of a counter is stored in a single machine word. 
In a realistic setting, the registers are spread among several consecutive words, and we use multiple 
precision subtractions and shifts to apply the expressions above on a sequence of words. All other 
(logical) operations have just to be applied to each word in sequence. 

All in all, by using the techniques above we can improve the speed of maximisation by a factor 
of w/fc, which in our case is about 13 (for graphs of up to 2 32 nodes). This actually results in a 
sixfold speed improvement of the overall application in typical cases (e.g., web graphs and 6 = 8), 
as about 90% of the computation time is spent in maximisation. 

Parallelisation via task decomposition. Although HyperANF is written as a sequential algo- 
rithm, the outer loop lends itself to be executed in parallel, which can be extremely fruitful on a 
modern multicore architecture; in particular, we approach this idea using task decomposition. We 
divide the iteration on the whole set of nodes into a set of small tasks (in the order of the thou- 
sands), where each task consists in iterating on a contiguous segment of nodes. A pool of threads 
picks up the first available task and solves it: as a result, we obtain a performance improvement 
that is linear in the number of cores. Threads can be designed to be extremely agile, helped by 
WebGraph's facilities which allow us to provide each thread with a lightweight copy of the graph 
that shares the bitstream and associated information with all other threads. 

Tracking modified counters. It is an easy observation that a counter c that does not change 
its value is not useful for the next step of the computation: all counters using c during their 
update would not change their value when maximising with c (and we do not even need to write c 
on disk). We thus keep track of modified counters and skip altogether the maximisation step with 
unmodified ones. Since, as we already remarked, 90% of computation time is spent in maximisation, 
this approach leads to a large speedup after the first phases of the computation, when most counters 
are stabilised. 

For the same reason, we keep track of the harmonic partial sums of small blocks (e.g., 64) of 
counters. The amount of memory required is negligible, but if no counter in the block has been 
modified, we can avoid a costly computation. 



:= (( ( {x | H k ) -{yk H k )) \ x ® y) ® (x | y)) & H k . 



m = 
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Systolic computation. HyperANF can be run in systolic mode. In this case, we use also the 
transposed graph: whenever a counter changes, it signals back to its predecessors that at the next 
round they could change their values. Now, at each iteration nodes that have not been signalled 
are entirely skipped during the computation. Systolic computations are fundamental to get high- 
precision runs, as they reduce the cost of an iteration to scanning only the arcs of the graph that 
are actually moving information around. We switch to systolic computation when less than one 
quarter of the counters change their values. 



3.4 Correctness, errors and memory usage 

Very little has been published about the statistical behaviour of ANF. The statistical properties 
of approximate counters are well known, but the values of such counters for each node are highly 
dependent, and adding them in a large amount can in principle lead to an arbitrarily large variance. 
Thus, making precise statistical statements about the outcome of a computation of ANF or Hyper- 
ANF requires some care. The discussion in the following sections is based on HyperANF, but its 
results can be applied mutatis mutandis to ANF as well. 

Consider the output N G (t) of algorithm 2 at a fixed iteration t. We can see it as a random 
variable 

where 5 each X i>t is the HyperLogLog counter that counts nodes reached by node i in t steps; what 
we want to prove in this section is a bound on the relative standard deviation of N G (t) (such a 
proof, albeit not difficult, is not provided in the papers about ANF). First observe that [FFGM07], 
for a fixed a number of registers m per counter, the standard deviation of X^ t satisfies 

y/Var[A M ] 
\B(i,t)\ 

where ij m is the guaranteed relative standard deviation of a HyperLogLog counter. Using the 
subadditivity of standard deviation (i.e., if A and B have finite variance, ^VarL4 + B] < V /Varp] + 
i/Var[_B]), we prove the following 

Theorem 1 The output N G (t) of Algorithm 2 at the t-th iteration is an asymptotically almost 
unbiased estimator® of N G {t), that is 

E K°® ] = 1 + <5i(n) + 0(1) for n ^ oo, 

where <5i is the same as in [FFGM07] [Theorem 1] (and \8\(x)\ < 5 • 10~ 5 as soon as m > 16). 
Moreover, N G (t) has the same relative standard deviation of the Xi 's, that is 

^/Var[jV G (t)] 

N G (t) ~ Vm - 

Proof. We have that E[N G {t)} = E [£ ien X i>t ] . By Theorem 1 of [FFGM07], E[X itt ] = \B(i, t)\ (1 + ^(n) + o(l)), 
hence the first statement. For the second result, we have: 



Var[# G (t)] EienV^A < V m E ien \B(i,t)\ 

— AT — AT — T l n 



N G (t) N G (t) N G (t) 



'Throughout this paper, wc use von Neumann's notation n = { 0, 1, . . . , n — 1 }, so i £ n means that < i < n. 

6 Prom now on, for the sake of readability we shall ignore the negligible bias on Na{t) as an estimator for Na{t)- 
the other estimators that will appear later on will be qualified as "(almost) unbiased" , where "almost" refers precisely 
to the above mentioned negligible bias. 
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Since, as we recalled in Section 3.1, the relative standard deviation r\ m satisfies r/ m < 1.06/^/m, 
to get a specific value r\ it is sufficient to choose m « 1.12/?7 2 ; this assumption yields an overall 
space requirement of about 

— 5- n fog log n bits 

(here, we used the obvious upper bound \B(i,t)\ < n). For instance, to obtain a relative standard 
deviation of 9.37% (in every iteration) on a graph of one billion nodes one needs 74.5 GB of main 
memory for the registers (for a comparison, snap would require 550 GB). Note that since we write 
to disk the new values of the registers, this is actually the only significant memory requirement (the 
graph can be kept on disk and mapped in memory, as it is scanned almost sequentially). 

Applying Chebyshev's inequality, we obtain the following: 



Corollary 1 For every e, 



Pr 



N G (t) 
N G (t) 



e (1 -e, 1 + e) 



> 1 - 



In [FFGM07] it is argued that the HyperLogLog error is approximately Gaussian; the counters, 
however, are not statistically independent and in fact the overall error does not appear to be nor- 
mally distributed. Nonetheless, for every fixed t, the random variable N G (t) seems to be unimodal 
(for example, the average p-value of the Dip unimodality test [HH85] for the cnr-2000 dataset is 
0.011), so we can apply the Vysochanskii-Petunin inequality [VP82], obtaining the bound 



Pr 



N G (t) 
N G (t) 



e (l-£, l + e) 



> 1 - 



4 „2 
'fro 

9e 2 



In the rest of the paper, to state clearly our theorems we will always assume error e with confidence 
1 — S. It is useful, as a practical reminder, to note that because of the above inequality for each point 
of the neighbourhood function we can assume a relative error of krj m with confidence 1 — 4/(9fc 2 ) 
(e.g., 2rj m with 90% confidence, or 3?7 m with 95% confidence). 

As an empirical counterpart to the previous results, we considered a relatively small graph of 
about 325 000 nodes (cnr-2000, see Section 6 for a full description) for which we can compute the 
exact neighbourhood function N G (~ ); we ran HyperANF 500 times with m — 256. At least 96% 
of the samples (for all t) has a relative error smaller than twice the theoretical relative standard 
deviation 6.62%. The percentage jumps up to 100% for three times the relative standard deviation, 
showing that the distribution of the values behaves better than what the theory would guarantee. 



4 Deriving useful data 

As advocated in [PGF02], being able to estimate the neighbourhood function on real- world net- 
works has several interesting applications. Unfortunately, all published results we are aware of lack 
statistical satellite data (such as confidence intervals, or distribution of the computed values) that 
make it possible to compare results from different research groups. Thus, in this section we try to 
discuss in detail how to derive useful data from an approximation of the neighbourhood function. 

The distance cdf. We start from the apparently easy task of computing the cumulative distribution 
function of distances of the graph G (in short, distance cdf ), which is the function H G (t) that gives 
the fraction of reachable pairs at distance at most t, that is, 

TT (t) Na{t) 

In other words, given an exact computation of the neighbourhood function, the distance cdf can be 
easily obtained by dividing all values by the largest one. Being able to estimate N G (t) allows one 
to produce a reliable approximation of the distance cdf: 
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Theorem 2 Assume Nc(t) is known for each t with error e and confidence 1 — 6, that is 

~N G (t) 



Pr 



N G (t) 



e (1 -e,l + £) 



> 1 - 5. 



Let H G (t) — N G (t)/ max t N G (t). Then H G (t) is an (almost) unbiased estimator for H G (t); more- 
over, for a fixed sequence t , t\, . . . , tk-i, for every e and all < i < k we have that H G {tu) is 
known with error 2s and confidence 1 — (k + 1)6, that is, 



Pr 



> 1 - (k + l)5. 



Proof. Note that if 



l-e<N G (t)/N G (t) < 1 + e 

holds for every t, then a fortiori 

1 - £ < max N G (t)/ max N G (t) < 1 + s 

(because, although the maxima might be first attained at different values of t, the same holds for 
any larger values). As a consequence, 



1 -2s < 



e H G (t) 1 + e 
< -^U- < < 1 + 2s. 



1 + e " H G (t) ~ 1-s 



The probability 1 - 
at the same time. 



(k + 1)5 is immediate from the union bound, as we are considering k + 1 events 



Note two significant limitations: first of all, making precise statements (i.e., with confidence) about 
all points of H G {t) requires a very high initial error and confidence. Second, the theorem holds if 
HyperANF has been run up to stabilisation, so that the probabilistic guarantees of HyperLogLog 
hold for all t. 

The first limitation makes in practice impossible to get directly sensible confidence intervals, for 
instance, for the average distance or higher moments of the distribution (we will elaborate further 
on this point later). Thus, only statements about a small, finite number of points can be approached 
directly. 

The second limitation is somewhat more serious in theory, albeit in practice it can be circum- 
vented making suitable assumptions about the graph under examination (which however should be 
clearly stated along the data). Consider the graph G made by two fc-cliques joined by a unidirec- 
tional path of I nodes (see Figure 2). Even neglecting the effect of approximation, G can "fool" 
HyperANF (or ANF) so that the distance cdf is completely wrong (see Figure 1) when using any 
stopping criterion that is not stabilisation. 

Indeed, the exact neighbourhood function of G is given by: 



N G (t) 



2k + £ if t = 

(t+l)(2k + £-l)-2k + 2k 2 iil<t<£ 
^(£+1) (2k + f) -2k + 3k 2 Xt<t. 



The key observation is that the very last value is significantly larger than all previous values, as 
at the last step the nodes of the right clique become reachable from the nodes of the first clique. 
Thus, if iteration stops before stabilisation, 7 the normalisation factor used to compute the cdf will 
be smaller by sa k 2 than the actual value, causing a completely wrong estimation of the cdf, as 
shown in Figure 1. 



7 We remark that stabilisation can occur, in principle, even before the last step because of hash collisions in 
HyperLogLog counters, but this will happen with a controlled probability. 
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Figure 1: The real cdf of the graph in Figure 2 (+), and the one that would be computed using any 
termination condition that is not stabilisation (*); here £ = 10 and k = 260. 



Figure 2: Two fc-cliques joined by a unidirectional path of £ nodes: terminating even one step earlier 
than stabilisation completely miscalculates the distance cdf (see Figure 1); the effective diameter 
is I + 1, but terminating even just one step earlier than stabilisation yields an estimated effective 
diameter of 1. 



Although this counterexample (which can be easily adapted to be symmetric) is definitely patho- 
logical, it suggests that a particular care should be taken when handling graphs that present narrow 
"tubes" connecting large connected components: in such scenarios, the function Nc(t) exhibits rel- 
atively long plateaux (preceded and followed by sharp bumps) that may fool the computation of 
the cdf. 

The effective diameter. The first application of ANF was the computation of the effective 
diameter. The effective diameter of G at a is the smallest to such that Ha (to) > ct\ when a is 
omitted, it is assumed to be a = .9. 8 The interpolated effective diameter is obtained in the same 
way on the linear interpolation of the points of the neighbourhood function. 

Since that the function Hq{£) is necessarily monotone in t (independently of the approximation 
error), from Theorem 2 we obtain: 

Corollary 2 Assume Nc(t) is known for each t with error e and confidence 1 — 5, and there are 
points s and t such that 

Hg(s) H G (t) 

< a < . 

1 - 2s ~ ~ 1 + 2s 

Then, with probability 1 — 35 the effective diameter at a lies in [s . . t] . 

Unfortunately, since the effective diameter depends sensitively on the distance cdf, again termination 
conditions can produce arbitrary errors. Getting back to the example of Figure 2, with a sufficiently 
large k, for example k = 2£ 2 + 5£ + 2, the effective diameter is £+1, which would be correctly output 
after £ + 1 iterations, whereas even stopping one step earlier (i.e., with t = £) would produce 1 as 



8 The actual diameter of G is its effective diameter at 1, albeit the latter is defined for all graphs whereas the 
former makes sense only in the strongly connected case. 



output, yielding an arbitrarily large error, snap, indeed, fails to produce the correct result on 
this graph, because it stops iterating whenever the ratio between two successive iterates of N G is 
sufficiently close to 1. 

Algorithm 3 Computing the effective diameter at a of a graph G; Algorithm 2 is used to compute 
N G . 



foreach t = 0, 1, . . . begin 

1 compute Ncii) (error e, confidence 1 — 5) 

2 if (some termination condition holds) break 

3 end; 

4 M «- ma,xN G (t) 

5 find the largest D~ such that N G {D~)/M < a(l - 2e) 

6 find the smallest D+ such that N G (D+)/M > a(l + 2e) 

7 output [D~ . . D + ] with confidence 1 — 35 

8 end; 



Algorithm 3 is used to estimate the effective diameter of a graph; albeit this approach is rea- 
sonable (and actually it is similar to that adopted by snap, although the latter does not provide 
any confidence interval), unless the neighbourhood function is known with very high precision it is 
almost impossible to obtain good upper bounds, because of the typical flatness of the distance cdf 
after the 90th percentile. Moreover, results computed using a termination condition different from 
stabilisation should always be taken with a grain of salt because of the discussion above. 
The distance density function. The situation, from a theoretical viewpoint, is somehow even 
worse when we consider the density function h G (—) associated to the cdf H G {—). Controlling the 
error on h G {—) is not easy: 

Lemma 1 Assume that, for a given t, H G {t) is an estimator of H G (t) with error e and confidence 
1 — 5. Then h G (t) = h G {t) ± 2e with confidence 1 — 25. 

Proof. With confidence 1 — 25, 

h G (t) = H G (t) - H G (t - 1) 

< (1 + s)H G (t) - (1 - e)H G (t - 1) < h G {t) + 2s, 

and similarly h G (t) > h G {t) — 2e. I 

Note that the bound is very weak: since our best generic lower bound is h G (t) > 1/n 2 , the relative 
error with which we known a point h G {t) is 2en 2 (which, of course, is pretty useless). 

Moments. Evaluation of the moments of h G (—) poses further problems. Actually, by Lemma 1 
we can deduce that 

*M*) - 2sD G < th c(t) < + ^ d g 

t t t 

with confidence 1 — 2D G e, where D G is the diameter of G, which implies that the expected value of 
h G {—) is an (almost) unbiased estimator of the expected value of h G (—). Nonetheless, the bounds 
we obtain are horrible (and actually unusable). 

The situation for the variance is even worse, as we have to prove that we can use Varf/ic] as 
an estimator to Var[h G }. Note that for a fixed graph G, H G is a precise distribution and Yav[h G ] 
is an actual number. Conversely, h G (and hence Var[/ic]) is a random variable 9 . By Theorem 2, 
we know that H G is an (almost) unbiased pointwise estimator for H G , and that we can control 
its concentration by suitably choosing the number m of counters. We are going to derive bounds 
on the approximation of Var[/ic] using the values of H G {t) up to D G (i.e., the iteration at which 
HyperANF stabilises): 

9 Morc precisely, Hq is a sequence of (stochastically dependent) random variables fiQ(0), h,Q(l), . . . 
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Lemma 2 Assume that, for every < t < D G , H G (t) is an estimator of H G (t) with error e and 
confidence 1 — 5; then, Va,r[h G ] is an estimator of\ai[h G ] with error 

p <^ Qp Or I Ap£ Or 

£ - 8£ Var[M +4£ Vax[h G ] 

and confidence 1 — (D G + 1)6. 

Proof. Assuming error s on the values of H in [0 . . Dq] implies confidence 1 — (D G + 1)S. Since 
D G < Dq < oo, and by definition h G (t) = for t > Dq we have (t ranges in [0 . . Dq]): 

Var[M =J2t 2 h G (t) - (j2 th G^) 2 
t t 

<J2t 2 (he (t) + 2e) (J2 th o (*) - ^ E *) 2 
t t t 

< V&r[h G ] +2e^2t 2 + 4eE[h G ] ^ t 

t t 

< Var[/i G ] + AeD 2 G {D G + E[h G }) 

< Var[/i G ] +8eD G , 

where E[h G ] is the average path length. Similarly 

Vax[ha] > Var[/t G ] - 8eD 3 G - Ae 2 D%. 

Hence the statement. I 

The error and confidence we obtain are again unusable, but the lemma proves that with enough 
precision and confidence on H G {—) we can get precision and confidence on Var[/i G ]. 

The results in this section suggests that if computations involve the moments the only realistic 
possibility is to resort to parametric statistics to study the behaviour of the value of interest on 
a large number of samples. That is, it is better to compute a large number of relatively low- 
precision approximate neighbourhood functions than a small number of high-precision ones, as 
from the former the latter are easily computable by averaging, whereas it is impossible to obtain 
a large number of samples of derived values from the latter. As we will see, this approach works 
surprisingly well. 



5 SPID 

The main purpose of computing aggregated data such as the distance distribution is that we can try 
to define indices that express some structural property of the graph we study, an obvious example 
being the average distance, or the effective diameter. 

One of the main goal of our recent research has been finding a simple property that clearly 
distinguishes between social networks deriving from human interaction (what is usually called a 
social network in the strong or proper sense: DBLP, Facebook, etc.) and web-based graphs, which 
share several properties of social networks, and as the latter arise from human activity, but present 
a visibly different structure. 

In this paper we propose for the first time to use the index of dispersion a 2 / 1 \i (a.k.a. variance- 
to-mean ratio) of the distance distribution as a measure of the "webbiness" of a social network. 
We call such an index the spid (shortest-paths index of dispersion) 1 ® of G. In particular, networks 
with a spid larger than one are to be considered "web-like" , whereas networks with a spid smaller 
than one are to be considered "properly social" . We recall that a distribution is called under- or 
over-dispersed depending on whether its index of dispersion is smaller or larger than 1, so a network 
is properly social or not depending on whether its distance distribution is under- or over-dispersed. 

1(, If wc were to follow strictly the terminology used in this paper, this would be the index of dispersion of the 
distance distribution, but we guessed that the acronym IDDD would not have been as as successful. 
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Figure 3: Cumulative density function of 100 values of the spid computed using HyperANF on 
cnr-2000. For comparison, we also plot random samples of size 100 and 10 000 drawn from a 
normal distribution. 



The intuition behind the spid is that "properly social" networks strongly favour short connec- 
tions, whereas in the web long connection are not uncommon: this intuition will be confirmed in 
Section 6. 

As discussed in the previous section, in theory estimating the spid is an impossible task, due to 
the inherent difficulty of evaluating the moments of Iiq(—). In practice, however, the estimate of 
the spid computed directly on runs of HyperANF are quite precise. From the actual neighbourhood 
function computed for cnr-2000 we deduce that the graph spid is 2.49. We then ran 100 iteration of 
HyperANF with a relative standard deviation of 9.37%, computing for each of them an estimation 
of the spid; these values approximately follow a normal distribution of mean 2.489 and standard 
deviation 0.9 (see Figure 3). We obtained analogous concentration results for the average distance. 
In some pathological cases, the distribution is not Gaussian, albeit it always turns out to be unimodal 
(in some cases, discarding few outliers), so we can apply the Vysochanskh-Petunin inequality We 
will report some relevant observations on the spid of a number of graphs after describing our 
experiments. 



6 Experiments 

We ran our experiments on the datasets described in Table 2: 

• the web graphs are almost all available at http : //law . dsi . unimi .it/, except for the altavista 
dataset that was provided by Yahoo! within the Webscope program (AltaVista webpage con- 
nectivity dataset, version 1.0, http://research.yahoo.com/Academic_Relations); 11 

• for the social networks: hollywood (http://www.imdb.com/) is a co-actorship graph where 
vertices represent actors; dblp (http://www.informatik.uni-trier.de/~ley/db/) is a sci- 
entific collaboration network where each vertex represents a scientist and two vertices are con- 
nected if they have worked together on an article; in 1 journal (http: //www. live journal . 
com/) nodes are users and there is an arc from x to y if x registered y among his friends (it is 
not necessary to ask y permission, so the graph is directed): amazon (http : //www . archive . 
org/details/amazon_similarity_isbn/) describes similarity among books as reported by 
the Amazon store; enron is a partially anonymised corpus of e-mail messages exchanged by 
some Enron employees (nodes represent people and there is an arc from x to y whenever y 
was the recipient of a message sent by x)\ finally in flickr (http://www.flickr.com/ 12 ) 

11 It should be remarked by this graph, albeit widely used in the literature, is not a good dataset. The dangling 
nodes are 53.74% — an impossibly high value [Vig07], and an almost sure indication that all nodes in the frontier of 
the crawler (and not only visited nodes) were added to the graph, and the giant component is less than 4% of the 
whole graph. 

12 We thank Yahoo! for the experimental results on the Flickr graph. 
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Graph 


snap 


HyperANF 


amazon 


9.5 m 


5s 


indochina-2004 


4.62 h 


1.83m 


altavista 




1.2h 




HADI (90 machines) 


HyperANF 


Kronecker 






(177 K nodes, 2B 


30 m 


2.25 m 


arcs) 







Tabic 1: A comparison of the speed of snap/HADI vs. HyperANF. The tests on snap were performed 
on our hardware. Both algorithms were stopped at a relative increment of 0.001. The timings of 
HADI on the M45 cluster are the best reported in [KTA+10], and both algorithms ran three 
iterations. We remark that a run of HyperANF on the Kronecker graph takes less than fifteen 
minutes on a laptop. 



vertices correspond to Flickr users and there is an edge connecting x and y whenever either 
vertex is recorded as a contact of the other one. 

At the best of our knowledge, this is the first paper where such a wide and diverse set of data is 
studied, and where features such as effective diameter or average path length are computed on very 
large graphs with precise statistical guarantees. 
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All experiments are performed on a Linux server equipped with Intel Xeon X5660 CPUs (2.80 GHz, 
12 MB cache size) for overall 24 cores and 128 GB of RAM; the server cost about 8 900 EUR in 
2010. 

A brief comparison with snap and HADI timings is shown in Table 1. Essentially, on our 
hardware HyperANF is two orders of magnitudes faster than snap. Our run on the Kronecker graph 
is one order of magnitude faster than HADFs (or three orders of magnitude faster, if you take into 
consideration the number of machines involved), but this comparison is unfair, as in principle HADI 
can scale to arbitrarily large graphs, whereas we are limited by the amount of memory available. 
Nonetheless, the speedup is clearly a breakthrough in the analysis of large graphs. It would be 
interesting to compare our timings for the altavista dataset with HADI's, but none have been 
published. 

It is this speed that makes it possible, for the first time, to compute data associated with the 
distance distribution with high precision and for a large number of graphs. We have 100 runs 
with relative standard deviation of 9.37% for all graphs, except for those on the altavista dataset 
(13.25%). All graphs are run to stabilisation. Our computations are necessarily much longer 
(usually, an order of magnitude longer in iterations) than those used to compute the effective 
diameter or similar measures. This is due to the necessity of computing with high precision second- 
order statistics that are used to compute the spid. 

Previous publications used few graphs, mainly because of the large computational effort that 
was necessary, and no data was available about the number of runs. Moreover, we give precise 
confidence intervals based on parametric statistics for data depending on the second moment, such 
as the spid — something that has never done before. We gather here our findings. 

A posteriori parameters are highly concentrated. According to our experiments, computing 
the effective diameter, average distance and spid on a large number of low-precision runs generates 
highly concentrated distributions (see the empirical standard deviation in Table 2). Thus, we 
suggest this approach for computing such values, provided that termination is by stabilisation. 
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Figure 4: A plot showing the strong linear correlation between the average distance and the effective 
diameter. 



Effective diameter and average distance are essentially linearly correlated. Figure 4 
shows a scatter plot of the two values, and the line 2x/3 + 1. The correlation between the two 
values has always been folklore in the study of social networks, but we can confirm that on both 
social and web networks the connection can be exactly expressed in linear terms (it would be of 
course interesting to prove such a correlation formally, under suitable restrictions on the structure 
of the graph). This fact suggests that the average distance (which is more principled from a statistic 
viewpoint, and parameter- free) should be used as the reference parameter to express the closeness 
between nodes. Moreover, experimentally the standard deviation of the effective diameter in a 
posteriori computations is usually significantly larger than that of the average distance. 
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Incidentally, the average distance of the altavista dataset is 16.5 — slightly more than what 
reported in [KTA+10] (possibly because of termination conditions artifacts). 

It is difficult to give a priori confidence intervals for the effective diameter with a 
small number of runs. Unless a large number of runs is available, so that the precision of 
the approximation of the neighbourhood function can be significantly lowered, it is impossible to 
provide interesting upper bounds for the effective diameter. 

The spid can tell social networks from web graphs. As shown in Table 2, even taking the 
standard deviation into account spids are pretty much below 1 for social networks and above 1 
for web graphs; host graphs (not surprisingly) behave like social networks. Note that this works 
both for directed and undirected graphs. Figure 5 shows the spid values obtained for our datasets 
plotted against the graph size, and also witnesses that there is no correlation (a similar graph, not 
shown here, testifies that spid is also independent from density). Figure 6 shows that there is some 
slight correlation between the spid and the average distance: nonetheless, there is no way to tell 
networks from our dataset apart using the latter value, whereas the under- or over-dispersion of 
the distance distribution, as defined by the spid, never makes a mistake. Of course, we expect to 
enrich this graph in time with more datasets: we are particularly interested in gathering very large 
social networks to test the spid at large sizes. 

We remark that, as a sanity check, we have also computed on several web-graph datasets the 
spid of the giant component, which turned out to be very similar to the spid of the whole graph. We 
see this as a clear sign that the spid is largely independent of the artifacts of the crawling process. 

Direction should not be destroyed when analysing a graph. We confirm that symmetrising 
graphs destroys the combinatorial structure of the network: the average distance drops to very low 
values in all cases, as well as the spid. This suggests that there is important structural information 
that is being ignored. We also note that since all web snapshot we have at hand are gathered by some 
kind of breadth-first visit, they represent balls of small diameter centred at the seed: symmetrising 
the graph we cannot expect to get an average distance that is larger than twice the radius of the 
ball. All in all, the only advantage of symmetrising a graph is a significant reduction in the number 
of iterations that are needed to complete a computation of the neighbourhood function. 13 




10000 100000 le+06 le+07 
size 



le+08 le+09 le+10 



Figure 5: A plot showing the spid values (vertical) for our datasets compared with their size (i.e., 
number of nodes, horizontal): red squares correspond to social networks, blue diamonds to web 
graphs and black circles to host graphs. 

To give a more direct idea of the level of precision of our diameter estimation, we computed 
the actual diameter at a for the cnr-2000 dataset, and plotted it against the interval estimation 
obtained by HyperANF 

13 We remark that the "diameter 7 ~ 8" claim in [KTA+10] about the altavista dataset refers to the effective 
diameter for the symmetrised version of the graph. 
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Figure 5. 



A plot showing the spid against the average distance using the same conventions of 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.' 
a 



Figure 7: Effective diameters at a for the cnr-2000 dataset; red bullets show the real effective di- 
ameter, whereas green crosses show the upper and lower extreme of the confidence interval obtained 
running 100 HyperANF with m = 128. 



7 Future work 

HyperANF lends itself naturally to distributed implementations. However, contrarily to the ap- 
proach taken by HADI [KTA+10], we think that the correct parallel framework for implementing 
a diffusing computation is a synchronous parallel system where computation happens at nodes and 
communication is sent from node to node with messages. Such a framework, Pregel, has been 
recently developed at Google [MAB+10]. In a Pregel implementation of HyperANF, every compu- 
tational node sends its own counter as message to its predecessors if it changed from the previous 
iteration, waits for incoming messages from its successors, and computes the maximisation pro- 
cedure on the received messages. Due to the small size of HyperLogLog counter (exponentially 
smaller than the Flajolct Martin counters used by ANF), the amount of communication would be 
very small. 

Although in this paper, we preferred to focus on the computation of the spid, we remark that 
HyperANF can also be used to build the radius distribution described in [KTA+10], or the related 
closeness centrality. 
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8 Conclusions 



HyperANF is a breakthrough improvement over the original ANF techniques, mainly because of the 
usage of the more powerful HyperLogLog counters combined with their fast broadword combination 
and systolic computation. HyperANF can run to stabilisation very large graphs, computing data 
with statistical guarantees. 

We consider, however, the introduction of the spid of a graph the main conceptual contribution 
of this paper. HyperLogLog is instrumental in making the computation of the spid possible, as the 
latter requires a number of iterations that is an order of magnitude larger than those required for 
an estimate of the effective diameter. 

Acknowledgements Flavio Chierichetti participated to the earlier phases of this work. We want 
to thank Dario Malchiodi for fruitful discussions and hints. 
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